Context

Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.

Objective

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-driven solutions. You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a classification model:

  1. Facilitate the process of visa approvals.
  2. Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the drivers that significantly influence the case status.

Data Description

The data contains the different attributes of the employee and the employer. The detailed data dictionary is given below.

Import Necessary Libraries

Load the data and check shape/info

Check categorical value counts

EDA

Define functions for visualization

Univariate Analysis

Data Preparation

Bivariate Analysis

Multivariate analysis

Data preprocessing cont'd

Number of Employees

Prevailing Wage

It seems unlikely that someone would be applying for a Visa with a job that pays below poverty guideline, especially for higher educated employees.

Checking the Foreign Labor Certification Data Center Online Wage Library (https://www.flcdatacenter.com/download.aspx) for the data reported in 2016, the prevailing wages are greater than or equal to 7.31, which is higher than the minimum prevailing wage in our data set. So we will assume values less than 7.31 are in error and change them to 7.31. On the other hand, the maximum from the same data was 290690, which is less than this data set's maximum and so we will change larger values to 290690.

Based on a 40-hour work week (most of these positions are full-time) and 48 work weeks per year, the prevailing wage for the hourly wage applications do not seem reasonable. Most of the wages seem way too high and the minimum is much too low.

Based on 48 work weeks per year, the prevailing wage for the weekly wage applications do not seem reasonable. Most of the wages seem way too high.

Based on 12 work months per year, the prevailing wage for the monthly wage applications do not seem reasonable. Most of the wages seem way too high.

Since applications with hourly, weekly, and monthly wage units selected do not seem to have reliable data for prevailing wage, we will try to fit a model without prevailing wage entirely (keeping unit of wage), and we will try to fit a model with the prevailing wage of just the yearly wage unit.

Dummy variables

Split the dataset

Model evaluation criterion

The model can make wrong predictions as:

  1. Predicting an application for Visa should be certified when it should be denied.
  2. Prediction an application for Visa should be denied when it should be certified.

Which case is more important?

  1. If the model predicts an application should be certified when it should be denied, the OFLC is allowing a foreigner to enter the country to fill a role that should be filled by US workers, contributing to an increase in unemployment.
  2. If the model predicts an application should be denied when it should be certified, there is a vacancy for an employment position that cannot be filled by US workers due to workforce shortages.

Which metric to optimize?

Define function to evaluate models

Decision Tree Classifier

Feature importance of Decision Tree

Hyperparameter Tuning - DT

Feature importance of Tuned Decision Tree

Random Forest Classifier

Feature importance of Random Forest Classifier

Hyperparameter Tuning - RF

Feature importance of Tuned Random Forest Classifier

Bagging Classifier

Hyperparameter Tuning - BC

AdaBoost Classifier

Feature importance of AdaBoost Classifier

Hyperparameter Tuning - ABC

Feature importance of Tuned AdaBoost Classifier

Gradient Boosting Classifier

Feature importance of Gradient Boost Classifier

Hyperparameter Tuning - GBC

Feature importance of Tuned Gradient Boost Classifier

XGBoost Classifier

Feature importance of XGBoost Classifier

Hyperparameter Tuning - XGB

Feature importance of Tuned XGBoost

Stacking Classifier

Comparing all models

Conclusions